Lab 2
October 13, 2025
Today’s goals. Learn about:
Workflow
Let’s start
Recall the concept of simulation
Simulation: Approximate a process and retrieve general results by assuming to observe the process several times.
Monte Carlo simulation
We rely on the so called Monte Carlo simulation, a wide class of simulation procedures based on sampling independent and identically distributed values from a process – precisely, from the underlying presumably true probability distribution of the process – and computing some numerical outputs.
The steps of a Monte Carlo simulation are
In what follows, we will consider the behavior of some sample statistics, such as the mean and the variance, through their sample distribution.
Three scenarios
Suppose that \(Y_1, \ldots, Y_n\), is sequence of iid rv from
Case 1: \(\mathcal{N}(0,1)\) (i.e. standard normal);
Case 2: \({\rm t}_{3}\) (i.e. \(t\)-student with 3 degrees of freedom);
Case 3: \(\mbox{U}(0,1)\) (i.e. standard continuous uniform).
Goal
Determine the distribution of the sample mean \(\overline{Y}=\frac{1}{n}\sum_{i=1}^{n}Y_{i}\)
We can answer using theoretical results, but let’s investigate via simulation. Consider:
Number of simulations \(R=1000\);
Sample size \(n=30\).
Note
Above we saved the results in a 3-dimensional array. Otherwise you can create 3 different matrix (of size \(n\times R\)), or according to the goal of the analysis you can compute directly some quantity of interest without saving the samples that you generate
Sample mean computation
We can compute the sample mean via two alternative but equivalent methods
apply() function.for looppar(mfrow = c(1, 3))
hist(sample_stat[1,], nclass = 30, probability = TRUE,
xlab="y", main= "N(0,1)", cex.main = 2.5, cex.lab = 2)
hist(sample_stat[2,], nclass = 30, probability = TRUE,
xlab = "y", main = expression(t[3]), cex.main = 2.5, cex.lab = 2)
hist(sample_stat[3,], nclass = 30, probability = TRUE,
xlab = "y", main = "U(0,1)", cex.main = 2.5, cex.lab = 2)Sample mean distribution
In the Gaussian case the sample mean \(\overline Y\) for iid values still follows a normal distribution, precisely \(\bar Y_n \sim \mathcal{N}(\mu, {\sigma^{2}}/{n})\).
For other distributions, this result holds only asymptotically, when \(n \rightarrow \infty\), due to the CLT theorem. When the sampling does not come from a Normal distribution \(\overline{Y} \overset{\cdot}{\sim} \mathcal{N}(\mu, {\sigma^{2}}/{n})\).
Time to working with R: It’s your turn
According to the previous comment, we want overlap the proper Gaussian density over the histograms.
par (mfrow=c(1,3))
hist(sample_stat[1,], nclass = 30, probability = TRUE,
xlab="y", main= "N(0,1)", cex.main = 1.5)
curve(dnorm(x, 0, sqrt(1/n)), add = TRUE, col = "red", lwd = 2)
hist(sample_stat[2,], nclass = 30, probability = TRUE,
xlab = "y", main = expression(t[3]), cex.main = 1.5)
curve(dnorm(x, 0, sqrt((3/((3 - 2) * n)))), add = TRUE, col = "red", lwd = 2)
hist(sample_stat[3,], nclass = 30, probability = TRUE,
xlab = "y", main = "U(0,1)", cex.main = 1.5)
curve(dnorm(x, 1/2, sqrt(1/(12 * n))), add = TRUE, col = "red", lwd = 2)# Empirical Cumulative Distribution Function (ECDF)
par(mfrow = c(1,3))
plot(ecdf(sample_stat[1,]), xlab="y", main= "N(0,1)", cex.main = 1.5)
curve(pnorm(x, 0, sqrt(1/n)), add = TRUE, col = "red", lty = 2, lwd = 2)
plot(ecdf(sample_stat[2,]), xlab="y", main= expression(t[3]), cex.main = 1.5)
curve(pnorm(x, 0, sqrt((3/((3 - 2) * n)))), add = TRUE, col = "red",
lty = 2, lwd = 2)
plot(ecdf(sample_stat[3,]), xlab="y", main= "U(0,1)", cex.main = 1.5)
curve(pnorm(x, 1/2, sqrt(1/(12 * n))), add = TRUE, col = "red",
lty = 2, lwd = 2)Sample variance under the Gaussian case
We know that \(S^{2}= \frac{1}{n-1} \sum_{i=1}^{n}(Y_{i}-\bar{Y})^{2}\) is an unbiased estimator for the variance (an estimator is said to be unbiased iff \(E(\widehat{{\theta}})= {\theta}\)).
The output provided by the function var() applied to a sample vector is the sample quantity \[s^2 = \frac{1}{n-1} \sum_{1}^{n}(y_i - \bar y)^2\]
Leveraging the example above, if we assume that the true generating model is normal (case 1), we know that the distribution of the sample variance is proportional to a \(\chi^{2}\) distribution with \(n-1\) degrees of freedom:
\[ \frac{(n-1)S^2}{\sigma^2} \sim \chi^{2}_{n-1}\]
Time to work with R: It’s your turn
Obtain the distribution of the sample variance via simulation, plot it via histograms/ecd and overlap the corresponding density/cumulative distribution function (using the formula above).
Remember the following transformation
Let \(X\) a r.v. with pdf \(f_X(x)\), then \(Y=g(X)\), with \(g(\cdot)\) an invertible function, has pdf \[f_Y(y)=f_X(g^{-1}(y))\Bigg | \frac{d}{dy} g^{-1}{(y)}\Bigg | \] So, in our case let \(X=\frac{(n-1)S^2}{\sigma^2} \sim \chi^2_{n-1}\), then \(S^2=Y=\frac{\sigma^2}{n-1} X\) is such that \[f_Y(y)= f_X \Big(\frac{n-1}{\sigma^2}y \Big) \frac{n-1}{\sigma^2},\]
since \(g^{-1}(y) = \frac{n-1}{\sigma^2}y\) and \(\frac{d}{dy} g^{-1}{(y)}=\frac{n-1}{\sigma^2}\)
par (mfrow = c(1, 2))
sigma <- 1
# This slightly differ implemened during the lecture
# During the lab I considered apply(samples, c(1,3), var)[1,]
sample_var <- apply(samples, c(1,3), var)
# Histogram
hist(sample_var[1,], nclass = 30, probability = TRUE,
xlab = expression(s^2), main = "Case 1: N(0,1)", cex.main = 1.5)
curve((n-1)/sigma^2 * dchisq(x * (n-1)/sigma^2, df = n - 1),
add = TRUE, col = "red", lwd = 2)
# ECDF vs CDF
plot(ecdf(sample_var[1,]), xlab = expression(s^2), main = "Case 1: N(0,1)",
cex.main = 1.5)
curve(pchisq(x * (n-1)/sigma^2, df = n - 1), add = TRUE, col = "red", lwd = 2)Problem of statistical inference
Let \(\mathcal{F} = \{p(y;\theta), \theta \in \Theta\}\) a parametric statistical model, with \(\Theta\) the parameter space and \(p(y; \theta)\) a collection of density (or probability) functions. Assume that \(\mathcal{F}\) is correctly specified, thus the true parameter value \(\theta^0 \in \Theta\). Some problems of the statistical inference on the parameter \(\theta\) rely on the following questions:
What we can say, having observed \(y\) and assumed \(\mathcal{F}\), on where \(\theta^0\) is allocated in \(\Theta\)? This is an estimation problem: if we want obtain from \(y\) a specific value of \(\theta^0\) we are performing point estimation (previous lab), while if we want target a subset of \(\Theta\) where is likely to be included \(\theta^0\), then we are involved in a interval estimation procedure (next lab).
Are the data reasonably agreeing with the hypothesis that \(\theta^0\) belong to a subset \(\Theta^0\) of \(\Theta\)? In such a case, we are involved in a hypothesis testing problem (next next lab).
Important
The true parameter value \(\theta^0\) is unknown, and carrying out inference on \(\theta\) (estimation, hypothesis testing) should be intended on \(\theta^0\)
Unbiasedness and consistency of estimators
Let \(\hat \theta\) be the point estimator for the parameter \(\theta\). An estimator is said to be
Bias–Variance tradeoff and Mean Squared Error
There are some properties that we would ensure for an estimator: low variance and low bias. But, there is a tradeoff between unbiasedness and low variance, so we would usually seek to get both (to some extent); ideally, we would target a small Mean Squared Error (MSE)
\[\rm{MSE}(\hat \theta)=E \left\{\left(\hat \theta-\theta\right)^2 \right\}= \left\{E(\hat \theta ) - \theta\right\}^2+var(\hat \theta)=\left(Bias\right)^2+Variance \]
Four estimators of the population mean
As we already know from the theory, the sample mean is an excellent estimator for the population mean. But we will consider now other estimators. Consider the case of a normal distribution, that is assume independent \(X_i \sim \mathcal{N}(\mu, \sigma^2)\), \(i = 1, \ldots, n\) and the following estimators (below \(X_{(i)}\) is the \(i\)-th order statistic):
| Estimator | Formula |
|---|---|
| Sample mean | \(\hat \mu_1=\frac{1}{n}\sum_{i=1}^{n}X_i\) |
| Sample median | \(\hat \mu_2= \begin{cases} X_{(n+1)/2} & n \text{ odd} \\ (X_{n/2} + X_{n/2+1})/2 & n \text{ even}\end{cases}\) |
| Mid-range | \(\hat \mu_3=\frac{X_{(1)}+X_{(n)}}{2}\) |
| Trimmed mean (10%) | \(\hat \mu_4=\frac{1}{0.8n}\sum_{i=0.1n}^{0.9n}X_{(i)}\) |
Note
The idea of the trimmed sample mean is to compute the mean discarding the 10% of data from the lower tail and the 10% from the upper tail (the following definition covers only the case when \(n\) is even and \(0.1n\) is an integer)
Goal
Compare the four estimators above in terms of unbiasedness and efficiency.
We will use Monte Carlo method to assess these properties.
Time to work with R: It’s your turn
R <- 1000 # Number of replications
n <- 10 # Sample size
mu <- 2 # Population mean
sigma <- 2 # Population standard deviation
est <- matrix(0, R, 4)
label_est <- c("Mean", "Median", "(Min + Max)/2", "10pTrimMean")
set.seed(1234)
for (i in 1 : R) {
x <- rnorm(n, mu, sigma)
est[i, 1] <- mean(x)
est[i, 2] <- median(x)
est[i, 3] <- (min(x) + max(x))/2
est[i, 4] <- mean(x, trim = 0.1)
}
par(mfrow = c(1, 1), xaxt = "n")
boxplot(est, main="Comparison between four estimators")
par(xaxt = "s")
axis(1, 1 : 4, label_est)
abline(h = mu, lwd = 2, col = "blue")[1] 0.012231786 0.032274611 -0.005109588 0.016567130
[1] 0.3674287 0.5111770 0.7202595 0.3870022
[1] 0.3675783 0.5122186 0.7202856 0.3872767
Note
All the estimators appear unbiased and the sample mean register the lowest estimated variance, as well as the lower estimated MSE
Goal
R <- 1000
n <- c(10, 200, 1000)
mu <- 2
sigma <- 2
est <- array(0, dim = c(R, 4, 3))
label_est <- c("Mean", "Median", "(Min + Max)/2", "10pTrimMean")
set.seed(13)
for(j in 1 : length(n)){
for (i in 1 : R) {
x <- rnorm(n[j], mu, sigma)
est[i, 1, j] <- mean(x)
est[i, 2, j] <- median(x)
est[i, 3, j] <- (max(x) + min(x))/2
est[i, 4, j] <- mean(x, trim = 0.1)
}
}
# Set up 4x3 layout: 4 rows (methods), 3 columns (sample sizes)
# But we'll arrange them as 2 columns of methods
par(mfrow = c(2, 6))
for(k in 1:2){ # First column: methods 1-2
for (j in 1 : length(n)) {
hist(est[,k, j], nclass = 30, probability = T, xlab = "", xlim = c(-2, 6),
main = paste(label_est[k], "- n =", n[j]))
abline(v = mu, col = "blue", lwd = 2)
}
}
for(k in 3:4){ # Second column: methods 3-4
for (j in 1 : length(n)) {
hist(est[,k, j], nclass = 30, probability = T, xlab = "", xlim = c(-2, 6),
main = paste(label_est[k], "- n =", n[j]))
abline(v = mu, col = "blue", lwd = 2)
}
}Note
Note that for sample mean estimator we know the theoretical variance \(\sigma^2/n\). The values obtained with \(n=10,200,1000\) are well approximating it.
# Variances
variance <- matrix(0, 3, 4)
for(j in 1:length(n)){
variance[j,] <- apply(est[,,j], 2, var)
}
# Bias
bias <- matrix(0, 3, 4)
for(j in 1:length(n)){
bias[j,] <- apply(est[,,j], 2, mean) - mu
}
# MSE
MSE <- variance + bias^2
colnames(bias) <- colnames(variance) <- colnames(MSE) <- label_est
rownames(bias) <- rownames(variance) <-
rownames(MSE) <- c("n = 10", "n = 200", "n = 1000" )
variance Mean Median (Min + Max)/2 10pTrimMean
n = 10 0.377966306 0.532483793 0.7084689 0.405919655
n = 200 0.020115434 0.032069233 0.3122608 0.021081454
n = 1000 0.003986159 0.005982136 0.2469938 0.004213873
Mean Median (Min + Max)/2 10pTrimMean
n = 10 -0.0143645259 -0.0058309337 -0.021308379 -0.0126285625
n = 200 0.0047097379 0.0105994276 -0.015492342 0.0074923632
n = 1000 -0.0004866974 -0.0005205328 0.009392445 -0.0001667324
Mean Median (Min + Max)/2 10pTrimMean
n = 10 0.378172645 0.532517793 0.7089229 0.406079136
n = 200 0.020137615 0.032181580 0.3125008 0.021137589
n = 1000 0.003986395 0.005982407 0.2470820 0.004213901
Note
Related to the distribution of the sample variance in the normal case, we could also rely on the biased estimator \(S^{2}_{b}= \frac{1}{n}\sum_{i=1}^{n}(Y_{i}-\bar{Y})^{2}\).
Check the biased nature of \(S^{2}_b\) via MC simulation, generating \(n = 10\) iid values from a normal distribution. Compare the results with the unbiased estimator \(S^{2}\)
#Initial settings
set.seed(2)
R <- 1000
n <- 10
mu <- 0
sigma <- 1
# Save the results in two vectors:
# s2: unbiased sample variance estimates
# s2_b: biased sample variance estimates
s2 <- rep(0, R)
s2_b <- rep(0, R)
# For each replication we generate 10 samples from
# a normal r.v. with mean mu and variance sigma^2
# and we compute the four sample variance estimates
for(i in 1 : R) {
y <- rnorm(n, mu, sigma)
s2[i] <- var(y)
s2_b[i] <- var(y) * (n - 1)/n
}
s2_mean <- mean(s2)
s2_b_mean <- mean(s2_b)#plot s2
par(mfrow = c(1, 2), oma = c(0, 0, 0, 0))
hist(s2, breaks = 50, xlab = expression(s^2), probability = TRUE,
main = expression(s^2), cex.main = 1.5)
#in red the true mean, in blue the estimated mean
abline(v = sigma^2, col = "red", lwd = 2)
abline(v = s2_mean, col = "blue", lwd = 3, lty = 3)
#plot s2 biased
hist(s2_b, breaks = 50, xlab = expression(s[b]^2), probability = TRUE,
main = expression(s[b]^2), cex.main = 1.5)
#in red the true mean, in blue the estimated mean
abline(v = sigma^2, col = "red", lwd = 3)
abline(v = s2_b_mean, col = "blue", lwd = 3, lty = 2)